thousand word
An Image is Worth More Than a Thousand Words: Towards Disentanglement in The Wild
Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains (e.g. of human faces) with minimal manual effort. Specifically, we use a recent language-image embedding model (CLIP) to annotate a set of attributes of interest in a zero-shot manner and demonstrate state-of-the-art disentangled image manipulation results.
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning--a fundamental component of human cognition--remains under-explored. We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided.
Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion
Haviv, Adi, Sarfaty, Shahar, Hacohen, Uri, Elkin-Koren, Niva, Livni, Roi, Bermano, Amit H
We begin by evaluating T2I models' ability to innovate and generalize through controlled experiments, revealing that stable diffusion models can effectively recreate unseen elements with sufficiently diverse training data. Then, our key insight is that concepts and combinations of image elements the model is familiar with, and saw more during training, are more concisly represented in the model's latent space. We hence propose a method that leverages textual inversion to measure the originality of an image based on the number of tokens required for its reconstruction by the Figure 1: Illustration of our approach for measuring image model. Our approach is inspired by legal definitions originality using multi-token textual inversion. Original images of originality and aims to assess whether a require more tokens for accurate reconstruction, while model can produce original content without relying common images like Van Gogh's "Starry Night" need only on specific prompts or having the training data one token.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation
Segalis, Eyal, Valevski, Dani, Lumen, Danny, Matias, Yossi, Leviathan, Yaniv
Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.
A Picture is Worth a Thousand Words: This Microsoft Model can Generate Images from Short Texts
I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Humans build knowledge in images. Every time we are presented with an idea or an experience, our brain immediately formulates visual representations of it.
Your Photo Of A Burrito Is Now Worth A Thousand Words
That burrito in your hands--so warm, so gooey, the richness cut by cilantro and red-hot spice. Before you take a bite, you'd better take a picture. Multiply that impulse by tens of thousands and you get Yelp's database of images, drawn from burrito joints, cocktail bars, and more. Until recently, Yelp was dependent on users to tag their images with search-friendly metadata. But now, using the kind of deep learning techniques that are transforming the field of AI, Yelp is starting to see the business benefits of using software intelligence to power its listing pages and user recommendations.